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I would like to thank the discussants for a number of deep and interesting 
comments and for their inspiring work on the subject over the years. I will 
not be able to address all the issues raised in the discussion; I will concentrate 
just on several of them. 

1. Local complexities and excess risk bounds. The first question is about 
possible ways to define distribution- and data-dependent complexities (such 
as local Rademacher complexities). The approach taken in my paper is based 
on geometric and probabilistic properties of the J-minimal set 



Ti5) ■.= {feT:Pf-mi^Pg<6} 



of the true risk function T 3 f ^ Pf. The first quantity of interest is the 
L2-diameter of this set, D{T; 6), and the second one is the function (pni^] 
that is equal to the expected supremum of empirical process indexed by the 
differences f — g, f,g& ^{^)- These two functions are then combined in the 
expression Un{5;t) that has its roots in Talagrand's concentration inequali- 
ties for empirical processes. The jj-transform of Un{-',t) (which is just a way 
to write solutions of fixed point-type equations) is then used to define the 
localized complexities that provide upper bounds on the excess risk. Under 
further assumptions, such as mean-variance relationships discussed in de- 
tail by Shen and Wang (Bartlett and Mendelson also discuss this and call 
the function classes satisfying these relationships "Bernstein classes"), these 
complexities can be redefined in terms of local L2-continuity modulus of 
empirical processes. Since the Rademacher process can be used as a data- 
dependent bootstrap-type "estimate" of the empirical process, this approach 
also leads to data-dependent local Rademacher complexities. The use of the 
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whole 5-minimal set is not the only possibility. One can also look at its 
"slices" ^2] := ^{52) and define the excess risk bounds in terms 

of the accuracy of empirical approximation on the slices. One can even make 
the slices really thin and look at {/ € ^ : Pf — mlg^jrPg = 5}. This was the 
approach taken by Peter Bartlett and Shahar Mendelson. Under an addi- 
tional (and relatively innocent) assumption that the class J- is star-shaped, 
they established excess risk bounds (and also ratio-type bounds) in terms of 
complexities of such "thin slices." I did not take this approach in my paper 
primarily because in most of the learning theory and statistical applications 
I had in mind it is hard to take real advantage of making the slices thin and, 
on the other hand, there is a need to take care of the assumption that the 
class is star-shaped (which is a minor difficulty). Bartlett and Mendelson 
went further by defining upper and lower bounds on the excess risk in terms 
of some characteristics of complexity of function classes that are more sub- 
tle than the fixed point-type local empirical complexities. However, as they 
pointed out, there is no way to estimate such complexities (at least in their 
current form) based on the data, which makes it impossible to use them as 
complexity penalties in model selection. Another way to define more subtle 
bounds on excess risk of empirical risk minimizers is considered in Section 4 
of my paper, and the situation is somewhat similar. In this section, I am 
trying to develop the bounds in the case when the risk function / 1-^ Pf has 
multiple minima in the class J-. In my view, this is an important problem 
with potential impact on model selection methodology (see some discussion 
in Section 4). I was able to come up with a modification of the definitions of 
local complexities and to prove the corresponding excess risk bounds in this 
case, but I was unable to design a data-dependent version of such complex- 
ities. At the moment, it seems to me that more subtle definitions of local 
Rademacher complexities pose some hard problems and the definition based 
on the fixed point approach is much more practical. 

Another interesting line of research is related to attempts to replace the 
Rademacher process by other bootstrap-type estimates of empirical pro- 
cesses. The most natural candidate is, probably, the empirical process based 
on Efron's classical bootstrap. Unlike the Rademacher process, this method 
of estimation of empirical process is known to be asymptotically correct (as 
it was proved by Cine and Zinn). Fromont [7] has recently done some pre- 
liminary work in this direction and obtained for Efron's bootstrap several 
inequalities similar to earlier results on Rademacher complexities. 

It is also important to extend excess risk bounds for empirical risk min- 
imizers to more general settings, in which the empirical risk is no longer 
the average of functions of i.i.d. random variables, but has a more com- 
plicated structure. Stephan Clemencon, Gabor Lugosi and Nicolas Vayatis 
consider an example of such a problem that is of interest in machine learning. 
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the so-called "ranking" problem. In this problem, the empirical risk has U- 
statistic structure and concentration inequalities and exponential bounds for 
U -statistics and [/-processes (and also for Rademacher chaos) play an impor- 
tant role. They successfully developed an interesting theory extending many 
of distribution-dependent excess risk bounds to this more general framework 
(although developing data-dependent bounds remains a challenge). 

2. Penalization and oracle inequalities. Not surprisingly, the role of com- 
plexity penalization in model selection problems of learning theory happened 
to be one of the main topics of the discussion. Gilles Blanchard and Pascal 
Massart compare in great detail penalized empirical risk minimization with 
cross-validation-type model selection techniques, primarily with hold-out 
(studied by Massart in the recent years). They emphasize serious difficulties 
with penalization methods in the practice of model selection. In particular, 
both complexity penalties and oracle inequalities typically involve constants 
that are far from being optimal, which makes the method useless from the 
practical point of view. The difficulties are even more serious in classifica- 
tion where it is hard to design penalties providing adaptation to the noise 
condition and, on the other hand, there are many possible choices of loss 
functions leading to many different solutions. Similar concerns have been 
raised by Xiatong Shen and Lifeng Wang and, to some extent, by Sara 
van de Geer. One can hardly disagree with this. However, Blanchard and 
Massart mentioned two reasons to be interested in complexity penalization 
approach. The first reason is related to the difficulties with implementing 
cross-validation for independent but not identically distributed observations. 
The second reason is the need to split the data into two or more parts, used 
for estimation and for validation of the model, which is a problem when 
the number of training examples is small and which results in reducing the 
efficiency of the method. I would like to add to this one more reason, which 
has been the most important for me. On the one hand, local Rademacher 
complexities provide a very general, abstract and essentially universal ap- 
proach to model selection in learning problems that can be formulated as 
empirical risk minimization. On the other hand, using bounds of the theory 
of empirical processes, they can be easily specialized in particular settings 
and they take many different shapes and forms depending on which com- 
plexity parameters are important in a specific problem. In some cases, the 
local Rademacher complexity becomes ^, where d is a linear dimension of 
the model; in other cases, it is — , where V is the VC-dimension; or, in bi- 
nary classification, it is where /i is a positive parameter that separates 
the value of regression function from 0; or it can depend on eigenvalues of 
the kernel in kernel machine learning, etc. Thus, at least in principle, this 
method can guide developers of learning machines by providing them with 
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flexible quantitative measures of complexity of the problem that have to be 
taken into account to select a good model and to avoid overfitting. Cross- 
validation might be as good (or better) practically (and there might be nice 
theoretical justifications of this method, as Massart showed in the case of 
hold-out), but it does not help us to understand how the performance of 
the method is related to the structure and complexity of the models. Being 
a very practical approach, the cross-validation is at the same time very ab- 
stract in the sense that it does not explicitly take into account the intrinsic 
complexity of the problem. Local Rademacher complexities can be also used 
to design penalties and develop model selection strategies in a very general 
framework, but they can be easily specialized to reflect specific structures 
of a particular problem. Of course, various questions raised by Blanchard 
and Massart, such as calibration of penalties based on the data, are very 
important in future development of this method. Also, I do not think that 
constants involved in complexity penalties and in oracle inequalities will for- 
ever remain prohibitively large and that this approach to model selection 
will be only a subject of theoretical exercises. On the contrary, very serious 
progress has been made in obtaining sharp constants in Talagrand's concen- 
tration inequalities due to the work of Ledoux, Massart, Rio, Bousquet and 
others during the recent years, and this is the main probabilistic tool used 
in analysis of model selection problems of learning theory. It will probably 
take some time for excess risk bounds and oracle inequalities with sharper 
constants to be developed, but it is only a matter of time. 

Recently, Bartlett [1] (see also the discussion paper by Bartlett and Mendel- 
son) made an interesting observation that if the classes J-j are nested in the 
sense that Fj C J^j+i, j > 1 and the corresponding excess risk bounds (5„(j) 
satisfy the monotonicity assumption (Jn(j) < <Jn(j + 1), j > 1, then there is 
a very simple way to prove a sharp oracle inequality for penalized empirical 
risk minimization. This result is so closely related to some of the excess risk 
bounds considered in my paper and it is so easy to prove that I cannot resist 
a temptation to prove its version here. I will use the notations of Section 5 
of my paper. Let T := Uj>i ^j- 

Lemma 1. Let 




and 



(2.2) 



2£p{TyJ) + 5n{3)>£pS^j-J). 
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Then, on the same event, 

£p{T; f) < inf [miPf - mfPf + 95„(i) 



Proof. Indeed, for j > k, 

£p{r,-J) < 2£p„{Tj;f) + 6n{j) = 2[ inf P„/ - inf + 5„(i) 



< 2 



inf P^f + A6n{k) - inf P^f - 4<5„(j) + 9(5„(j), 



which is bounded by 9(5„(j) since, by the definition of k, the term in the 
bracket is nonpositive. This imphes 

P/< inf P/ + 95„(j). 

Consider now the case j < k and > 5n{k)/9. In this case we simply 

have 

Pf < inf Pf + 5n{k) < inf Pf + 96n{j) 

[note that (2.1) implies that, for all j, £p{J^j;fj) < Finally, j < k 

and 6n{j) < A„(A;)/9, then the definition of k implies that 

inf £p„iJ^f^; f) = inf P„/ - inf P„/ > 4(d„(^) - <5„(i)) > 35n{k). 



Therefore, 



implying 



2 m| £p{F~^- f) + 6n{k) > inf £p{J^-^; f) > 36n{k), 



ini £p{T-^-f)>5n{k)>£p{J'f^;f) 



and, as a consequence, 



P/< inf P/< inf P/ + 95„(i). 



The result now follows. □ 



It follows from the excess risk bounds of Section 3 (see Lemma 2 and 
its proof; see also the proof of Theorem 7) that conditions (2.1) and (2.2) 
do hold on an event E of probability close to 1. To apply the lemma, one 
has to make dnik) a nondecreasing sequence with respect to k (as we did in 
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Section 5.3). This simple fact immediately shows that not only the compar- 
ison, but also the penalization method of model selection will be adaptive 
to the noise conditions in classification provided that we deal with nested 
models: monotonicity simplifies the matter. However, I would like to point 
out that even when the classes J-j are nested, the excess risk bounds, such 
as 6n{^j',t), do not necessarily form an increasing sequence [one can easily 
construct examples of !Fi C ^2 such that 6n{J^i]t) > Sn{J^2',t)]- So, it is not 
always a good idea to "monotonize" the penalties even in the case of nested 
models! 

Of course, the assumption that the classes are nested excludes many im- 
portant examples. The simplest one is the example in which Tj = {fj}- This 
is what Blanchard and Massart deal with in their Theorem 1 that provides 
a simple justification of the hold-out method of model selection. In this case, 
the oracle inequality of the lemma does not apply and Blanchard and Mas- 
sart show a weaker form of oracle inequality that involves a constant C > 1 
in front of the approximation error term. In addition, there is a term that 
depends on the function ip (describing the relationship between the excess 
risk and the variance). 

Alexandre Tsybakov looks at model selection problems in a broader con- 
text of aggregation of statistical estimates. He conjectures that, in general, 
no aggregation procedure based on simple selection of one of N preliminary 
trained estimates achieves the optimal aggregation rate, which is known to 
be of the order It is easy to provide some evidence that this conjecture 

is true in an abstract framework. Namely, let J- := {/i, . . . , /at} and 

/ := argmin{P„/ : f eT}. 
Proposition 1. (i) For any functions fj-.S^ [0,1], l<j<N 

V n 

with some numerical constant C > 0. 

(ii) There exist a space S, a probability measure P on it and functions 
/j : 5 [0, 1], 1 < J < iV such that 



E£{J^;f)>Cn,N 

where Cn,N ■= a — 6((logiV)~" + n^^/^), a,b,a>0 being numerical constants. 

The proof of part (i) is a straightforward application of the excess risk 
bounds in my paper and well-known bounds on expectation of the sup-norm 
of Rademacher process indexed by a finite class of functions. Part (ii) can 
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be shown by a simple modification of the example of Proposition 2 in the 
paper. Namely, take S = {0, 1}^. Let P be the uniform distribution on S. 
Take 




Define 

fj{x) = {l-5)xj + 5, l<J<iV-l, and 
fN{x) = {I- 6)XN, x = {xi,...,xn)&S. 

The proof of (ii) is a minor modification of the proof of Proposition 2. 

As an alternative to a simple model selection, Tsybakov advocates using 
convex mixtures of preliminary estimates with data-dependent weights that 
allow one to achieve the optimal MS-aggregation rate of the order 
He discusses several interesting approaches (in particular, mirror descent 
method) to model selection and convex aggregation in problems of risk min- 
imization with convex loss, such as regression and large margin classification, 
and poses some interesting open problems concerning excess risk bounds for 
such aggregation procedures. 

Empirical risk minimization with convex loss function is, probably, the 
most popular approach to the development of learning algorithms, in par- 
ticular, in regression and classification. Sara van de Geer suggested a way 
to extend some of the excess risk bounds considered in my paper to the 
case of possibly unbounded convex losses. She also made some interesting 
observations about the role of excess risk bounds and of "noise" or "margin" 
behavior of the models in model selection problems. At the end, she briefly 
mentioned ^i-penalization as a promising approach to model selection, and 
I would like to comment a little more on this since, in my view, it might be 
an area where very important developments in learning theory are about to 
take place. 

3. £p-penalties and sparsity. Many learning problems (in particular, op- 
timal aggregation of regression estimates or of classifiers) can be studied in 
the following framework. Let Ti. := {hi, . . . , hjsf} be a large set of functions 
from S into [—1, 1]. For instance, Ti can be a large dictionary consisting of N 
atoms and used to represent functions as linear combinations of the atoms, 
or it can be a set of pretrained estimates in regression or classification, or it 
can be a set of features characterizing an image. For A gM^, denote 

N 

/a:=^Aj/ij, A=(Ai,...,A^)gM^. 
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Often, learning problems can be formulated as risk minimization 

A° := argminP(^ • fx) 

with a convex loss function i (see Section 7 of my paper). Since the distri- 
bution P is unknown, the risk P{i • f\) has to be replaced by its empirical 
version Pn{i • fx) and, when is very large, there is a need to penalize 
it for complexity in order to avoid overfitting. This leads to the following 
penalized ERM problem: 

(3.1) := argmin[P„(£ • fx) + e pen(A)], 

where e > is a regularization parameter and pen is a complexity penalty 
defined on M^. It is to be compared with the problem of penalized true risk 
minimization 

(3.2) A^ := argmin[P(^ • fx) + £ pen(A)]. 

Imagine now that the solution A'^ of the true risk minimization problem is 
"sparse" in the sense that most of the components of vector A'' are equal 
to zero (or, at least, they are very small). The question is then whether it 
is possible to find complexity penalties that would allow us to recover the 
sparse solution with a reasonable accuracy. One obvious choice is 

pen(A) := card{j : Xj / 0}. 

This corresponds to "hard thresholding" frequently used in signal processing 
and nonparametric statistics. It is relatively easy to analyze the resulting 
penalized ERM problem using the techniques of my paper and to obtain 
reasonable bounds on excess risk P{i • f^^r) — P{1 • fxo). However, with this 
choice of penalty, the penalized ERM problem is computationally intractable 
and, as an alternative, the £i-penalty 

N 

pen(A) := |Aj| 
i=i 

has been frequently used. The resulting optimization problem is convex and 
it is computationally tractable. This approach is close to what is called 
"soft thresholding" in nonparametric statistics and LASSO in regression. 
Similar algorithms are known in signal processing and computational har- 
monic analysis (basis pursuit). There has been very interesting recent work 
on mathematical justification of this approach in several settings (see [2, 3, 
4, 5, 6, 10, 11]). It was shown that in many cases the minimization of the £i- 
norm leads to the recovery of the sparsest solution of the problem. However, 
the study of sparsity properties of the solution A^ of (3.1) with ^i-penalty 
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remains a challenge when this problem is considered in full generality (for 
general convex loss functions i and without restrictive assumptions that 
functions hj are almost orthogonal). Recently, I looked at this problem with 

N 

pen(A) := ||A||?^ = ^ |A, ^ 
i=i 

for p = 1 + jy [8, 9]. For such value of p, the ^p-norm is within a numerical 
constant from the ^i-norm (so, in some sense, such a penalization is equiva- 
lent to the ^i-penalization). On the other hand, the penalty is strictly convex 
which is an advantage in the analysis of the problem. In this setting, it was 
possible to prove (under somewhat restrictive assumptions on the loss) that 
"approximate sparsity" of A"^ leads to "approximate sparsity" of A^. More 
precisely, for A = (Ai, . . . , Aat), define its sparsity function as 

N 

j=d+i 

where |A[i]| > |A[2]| > ••• is a decreasing rearrangement of the components 
of A. Then, for some constants D depending only on i and K depending on 



e and II A^ll^, , for ah ^ > 1 and for e > D^^±^^^, the condition 7d(A^) = 
implies that with probability at least 1 — N^^ 



U + A\ogN 



n 



Moreover, for e > DlogNy !i±A^sN_^ without any assumption on jdi-^^ 



U + A\ogN 



n 



and 

ld{\')<C^d(\') + K^ 



ld + A\ogN 



n 



where C > is a numerical constant. These sparsity bounds are true with no 
restriction on the functions hj. Under further restriction that {hj,j £ J*} are 
linearly independent, where J* is a set with card( J*) =: d* such that A^ = 
for j ^ J* and e >0, the sparsity bounds lead to the bounds on excess risk 
P{e»f~^,)-P{£»f^o) of the order ^ and on the £i-norm ||A^-A°||^, of the 

order (up to log A^-factors) . 

From the point of view of learning theory, the linear dimension d involved 
in the definition of the sparsity is only the simplest way to measure the 
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complexity of function classes. There are many other notions of complexity 
that are involved in the excess risk bounds discussed in my paper. It would 
be really interesting to find ways to describe the sparsity phenomenon in 
learning problems for various classes of learning machines (boosting, kernel 
machines, etc.) where other measures of complexity are relevant and develop 
penalization techniques that guarantee some degree of sparsity of empirical 
solutions provided that true solutions are sparse. 
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